Statistical Science 

2011. Vol. 26, No. 2, 271-287 

DOI: 10.1214/10-STS349 

© Institute of Mathematical Statistics, 2011 



Estimating Random Effects via 
Adjustment for Density Maximization 1 

Carl Morris and Ruoxi Tang 

Abstract. We develop and evaluate point and interval estimates for 

the random effects 6{, having made observations yi\6i ~* N[9i, Vi\,i = 
1, . . . , k that follow a two-level Normal hierarchical model. Fitting this 
model requires assessing the Level-2 variance A = Var(0j) to estimate 
shrinkages Bi = Vi/(Vi + A) toward a (possibly estimated) subspace, 
with Bi as the target because the conditional means and variances of Oi 
depend linearly on Bi, not on A. Adjustment for density maximization, 
ADM, can do the fitting for any smooth prior on A. Like the MLE, 
ADM bases inferences on two derivatives, but ADM can approximate 
with any Pearson family, with Beta distributions being appropriate 
because shrinkage factors satisfy < Bi < 1. 

Our emphasis is on frequency properties, which leads to adopting 
a uniform prior on A > 0, which then puts Stein's harmonic prior 
(SHP) on the k random effects. It is known for the "equal variances 
case" V\ = • ■ • = Vf. that formal Bayes procedures for this prior produce 
admissible minimax estimates of the random effects, and that the pos- 
terior variances are large enough to provide confidence intervals that 
meet their nominal coverages. Similar results are seen to hold for our 
approximating "ADM-SHP" procedure for equal variances and also for 
the unequal variances situations checked here. 

For shrinkage coefficient estimation, the ADM-SHP procedure allows 
an alternative frequency interpretation. Writing L{A) as the likelihood 
of Bi with i fixed, ADM-SHP estimates Bi as Bi = Vi/(Vi + A) with A = 
argmax(A * L(A)). This justifies the term "adjustment for likelihood 
maximization," ALM. 

Key words and phrases: Shrinkage, ADM, Normal multilevel model, 
Stein estimation, objective Bayes. 
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1. INTRODUCTION 

This concerns approximate frequentist, Bayesian, 
and objective Bayesian inferences for a widely ap- 
plied two-level Normal hierarchical model. At Level-1, 
for i = 1, . . . , k, unbiased estimates yi are observed 
with means Oi and with known variance Vi. In prac- 
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tice the {Vi} usually are unequal, perhaps with V{ = 
a 2 /rii and a 2 known or accurately estimated. Thus 

(1) yi \0i*& N[0i,Vi], i = l,...,k. 

In practice each Level- 1 value yi here represents 
a sufficient statistic or a summary unbiased estimate 
based on the nj observations taken from the ith of 
the k units (e.g., a hospital, a small area, or a teach- 
ing unit). 

Level-2 specifies a Normal model for the random 
effects 9i, each with its own r-dimensional predictor 
variables x% so that for fi and an unknown variance 
A> 0, 

(2) 9 i \fi,A i ^ d N[ tH = x^fi,A], i = l,...,k. 

The case r = corresponds to fi fully known and 
then it may be convenient to set fi = and \n = 0, 
WLoG. If r > 1, X = (x^, X2, • • • , as a known 
A; x r matrix, assumed to have full rank r. 

The marginal distribution of y = (y±, ... ,yk)', gi- 
ven /? and A, and the conditional distribution of 9i 
follow from the above, so that 

(3) y^A^N^Vi + A], i = l,...,k, 
OfouhAW N[(l - Bi) yi + B if H t Vi^ ~ B i)l 



where = x\fi, and Bi = V .^ A is a "shrinkage fac- 
tor." 

When r > 1, the vector fi is assumed throughout 
to follow Lebesgue's flat prior on [0,oo), so 

(5) p{fi, A) dfi dA oc elfin (A) dA. 

Using this flat prior density for fi is equivalent to re- 
stricted maximum likelihood (REML). When n(A) 
is proper, the posterior distribution for this prior is 
proper (it integrates finitely) if k >r. When tt(A) 
is improper, a larger k is needed, with k > r + 3 
sufficing for the main distributions it (A) of inter- 
est here. When r = 0, as assumed initially, or when 
r > 1 and with /3 integrated out, we can focus on the 
main issue of dealing with the (nuisance) variance 
component A = Var(#j) and how to make inferences 
about the shrinkages Bi. 

Widely used programs like HLM, ML3 and SAS 
use MLE/REML methods to fit this model, while 
software for fully Bayesian inferences is available via 
BUGS and MLwiN ((Rasbash et al., 2001)). Max- 
imum likelihood and REML obtain an estimate A 
that maximizes the likelihood function of A (or mar- 



ginal likelihood in the REML case). Asymptotically 
(k large), maximum likelihood provides optimal es- 
timates of A, leading to convergence of estimates 
via frequentist and Bayesian approaches. However, 
the standard errors assigned by MLE and REML 
methods to the random effect estimates and the 
corresponding interval estimates can lead to confi- 
dence intervals with much smaller than their nomi- 
nal confidences, even asymptotically. This happens 
with MLE and REML methods not only because A 
can be underestimated so that shrinkages are over- 
estimated, but also because these procedures do not 
account for the fact that A has been estimated. 

Maximum likelihood and REML estimates of A 
not infrequently produce A = 0, in which case shrink- 
age MLEs are B% = 1. Examples occur in every field, 
as for the 8 schools data ((Gelman et al., 2004)), and 
in small area estimation ((Bell, 1999)). Then, per 
typical usage, the variance estimates may be taken 
to be Vi(l — Bi) = when r = 0, leading to zero- 
width or overly narrow confidence intervals of 9i. 
As will be seen in Section 4, even when A > and 
this situation is avoided, overfitting via MLE and 
REML can be considerable and nominal 95% confi- 
dence intervals for 9i might have true coverages in 
the 50-80% range. 

The procedures developed here to fit the two-level 
model above offer computational ease comparable 
to maximum likelihood and REML methods, be- 
ing based on differentiating the (adjusted) likelihood 
function twice. When k is small or moderate, how- 
ever, the adjustment provides much better standard 
errors and interval coverages. "Better" coverage is 
meant in the Level-2 frequentist sense of averaging 
over the data and the Level-2 model (2), for all fi- 
xed fi, A, as illustrated in the equal variances case 
of Figure 6, Section 4. 

Central to this development is the ADM proce- 
dure, "adjustment for density maximization" ((Mor- 
ris, 1988b)), albeit not then with the ADM label. 
ADM can be used with any Pearson family (Normal, 
Gamma, Inverted Gamma, Beta, F, t or skew-i) to 
approximate another distribution with a one-dimen- 
sional density. One merely multiplies the density by 
an adjustment which is determined by the Pearson 
family, and then makes the argmax function produce 
the mean, not the mode, of the Pearson distribution. 
As seen in (4), posterior means and variances of the 
random effects are linear functions of the shrinkage 
factors Bi, not of A, so it is desirable to estimate 
the posterior mean of Bi, and not the mode of Bi 



ESTIMATING RANDOM EFFECTS 



3 



or the mean of A. Shrinkage factor distributions are 
skewed and lie in [0,1], both of which make a Beta 
distribution approximate better than a Normal. Fit- 
ting Beta distributions via ADM is described in Sec- 
tion 2.3. 

Estimating shrinkage factors via ADM will be seen 
to reduce to maximizing the posterior density of A 
(or the marginalized density, if necessary) , after hav- 
ing multiplied this density by A. This adjustment 
has several benefits, which include prevention of es- 
timating A as 0, and overestimating A by just enough 
to account for the convex dependence of Bi on A. 
ADM methods have been used successfully before to 
improve inferences of random effects in other multi- 
level models, as in Christiansen and Morris (1997) 
for a Poisson multilevel model. 

The main procedure here approximates a formal 
posterior distribution stemming from the flat prior 
tt(A) = 1 on A > in (5). This flat prior on A, in 
conjunction with (2), induces Stein's harmonic prior 
(SHP) (30) on the random effects ((Stein, 1981)) 
and a minimax admissible estimator. (Stein's prior 
on for k > 3, d6/\\6\\( k ~ 2 \ is harmonic except at 
the origin, so it actually is "superharmonic." The 
shorter term "harmonic" is used here for simplicity 
of discourse.) The ADM approximations are seen 
in Section 3 to approximate closely the exact poste- 
rior means and variances of the random effects. But- 
tressed with the examples of Sections 3 and 4, our 
assessments show, by frequency standards, so for all 
fixed hyperparameters A > and /3, that the ADM- 
SHP combination outperforms commonly used MLE 
and REML procedures for estimating the random 
effects (l)-(5). 

The ADM approximations of Section 2.7 apply to 
any smooth prior density it (A), including the scale- 
invariant prior densities tt(A) on A 

(6) it (A) dA oc A c ~ l dA, c> 0. 

These receive some specific attention, but our fre- 
quency evaluations are limited to the special choice 
in (6) of c = 1 for which A ~ Unif (0, oo). Stein's har- 
monic prior not only produces safe frequency pro- 
cedures for squared-error point estimation, but the 
posterior variances of 0i are large enough to serve 
as a basis for confidence intervals centered at the 
posterior means ((Stein, 1981); Morris 1983b, 1988a; 
(Christiansen and Morris, 1997)). Hierarchically, the 
uniform formal prior it {A) = 1 is suggested by the 
fact that the renowned James-Stein estimator is the 
posterior mean, exactly, if this flat prior is extended 



(inappropriately) to A ~ Unif [— V, oo) (Morris, 1977, 
1983b). 

Section 2 starts with the "equal variances case," 
Stein's setting ((James and Stein, 1961)) for which 
Vi = ■ ■ ■ = Vk(= V) . Although equal variances are 
unusual in practice, this situation provides a rich 
and meaningful structure that has been studied wide- 
ly because of its relative simplicity for mathemat- 
ical investigation. Among other advantages, when 
r > and the unknown means fa must be estimated, 
the equal variances situation allows easy recovery of 
risks and coverage probabilities merely by translat- 
ing these quantities from the simpler (k — /^-dimen- 
sional situation when shrinkages are toward known 
means fa = 0. Also with equal variances, ADM ap- 
proximations to Bayes rules are easily developed for 
the range of scale-invariant priors (6), merely by 
solving a quadratic equation for A. 

Section 2 continues by extending these ADM rules 
for the "unequal variance case" (the variances Vi dif- 
fer, as is common in practice). Section 2.8 introduces 
a new, more general approximation for the poste- 
rior means and variances, which allows any r > so 
that shrinkages can be toward an estimated regres- 
sion. With computational and programming meth- 
ods similar to those of REML, noticeably more ac- 
curate procedures emerge. 

Section 3 examines how well ADM methods ap- 
proximate the exact Bayes rule. These approxima- 
tions are good for small values of k and they be- 
come exact as k — > oo. Even the data analyst who 
insists on exact computations can find such approx- 
imations useful because of increased speed, even if 
only for doing preliminary analyses. 

For the case c = 1 when A is flat, Section 4 eval- 
uates the resulting ADM-SHP procedure's perfor- 
mance in repeated sampling for relative mean squa- 
red errors and for interval coverages. In the equal 
variances case, and in the unequal variance exam- 
ples considered, nominal coverages are achieved or 
exceeded for any k > r + 3. MLE and REML proce- 
dures cannot do this. 

2. ADJUSTMENT FOR DENSITY 
MAXIMIZATION 

This section starts by examining the inadequacy 
of MLE methods as a basis for inferences about 
shrinkage factors Bi and random effects, and why 
the ADM approach for shrinkage constants should 
be better. For most of this section r = 0, the dimen- 
sion of p, so that (3 and all fa = E(QA are assumed 
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known. Thus, the only unknown Level-2 (nuisance) 
parameter is A, the between groups variance that 
governs the shrinkage factors Bi = y^r& - With r = 0, 
(3) and (4) simplify slightly to 

yi\A~N(jH,Vi + A), 

(7) with shrinkage factor B; = — -, and 

v ; B Vi + A 

6i\ yi ,A ~ 7V((1 - Bi) Vi + Bun, Vi(l - Bi)). 

Let Si = (yi — fii) 2 ~ (Vi + A)xi independently. S ~ 
(Si, . . . , Sk)' is a (minimal, if all Vi differ) sufficient 
statistic for A > 0. Then A\ = Si — Vi for i = 1, . . . , k 
are independent unbiased estimates of A with 
Var(Ai) = 2(Vi + A) 2 . One could average these Ai, 
weighted by the reciprocal of these variances to es- 
timate A, iteratively until convergence, with a neg- 
ative estimate of A reset to 0. This produces Amle, 
the MLE of A ((Efron and Morris, 1975)). 

In the equal variances case, S + = Y^i=i &i ^ s com- 
plete and sufficient for A, S + ~ (V + A)x 2 .- Then 
A un b = \^2,Ai = ^- — V is unbiased for A. Of course, 
^4unb can be negative, and P(A un ^ < 0) = P(x\ < 
kB), where the equal shrinkages are B = V X A - Be- 
cause k exceeds the median of \% P(x\ — kB) > 1/2 
if B is near 1 so that A is near to zero. This inequal- 
ity holds for any k if A < ^ , in which case ^4 un b < 
and Amle = more often than not. This issue of 
^mle being zero or quite small has received theo- 
retical attention at least since Morris (1983b), and 
has been recognized for some time in practice ((Bell, 
1999)), because its occurrence is not rare. Still, the 
problem has yet to be sufficiently recognized so as to 
be avoided in practice, and avoided in widely used 
software. 

When r = the likelihood function is proportional 

to 

L (A)^^Jl(V l + A)- 1 /^ 

(8) 

This is positive at A = and decreasing near if the 
SiS are small enough to make the exponential term 
be nearly constant. Then is a local maximum and 
if ^mle = Fisher's information cannot be used to 
assess the variance of the MLE. Furthermore, when 
Imle = 0, the MLE of Vax(0i\y,A) = V(l - Bi) 



also is zero. An unwary data analyst who uses this 
for the width of a confidence interval would assert 
that 9i = Hi with arbitrarily high confidence. 

The left panel of Figure 1 illustrates a case when 
the logarithm of the posterior density of A, equiva- 
lently the log- likelihood log(Lo(^)) since A has a flat 
prior, cannot use Fisher's observed information to 
estimate the variance of A since Amle — 0, there 
is no stationary point, and the second derivative is 
not negative. The situation for these data is much 
improved by using ADM to arrive at the adjusted 
log-likelihood in the middle and right panels of Fig- 
ure 1. 

2.1 Comparing ADM and MLE Methods 

MLE methods, viewed from a Bayesian (poste- 
rior probability) perspective, amount to finding the 
posterior mode of a parameter's distribution and its 
variance (reciprocal of observed information) when 
the parameter has a flat prior distribution. Normal 
distributions are used to approximate the MLE's 
distribution based on two derivatives of the log-likeli- 
hood. That works well when the likelihood is ap- 
proximately Normal, for example, with large sam- 
ples, but it works poorly when likelihoods are quite 
non-Normal, as can happen when estimating shrink- 
age factors. 

Morris (1988b), on approximating posterior dis- 
tributions, showed how to fit any prespecified Pear- 
son family (Normal, Gamma, F, Beta, t, etc.) to 
a density (but also a likelihood function) by cal- 
culating two derivatives of the "adjusted" (poste- 
rior) density function. The adjustment, multiplying 
by the quadratic or linear function that generates 
the particular Pearson family, makes the maximizer 
approximate the mean of the parameter, and not its 
mode. For a nearly symmetric bell-shaped distribu- 
tion or likelihood, the Normal is the best Pearson 
approximation, the adjustment is a constant. Then 
the mode agrees with the mean and the MLE is the 
ADM. For skewed likelihoods, the statistician may 
be able to choose a better approximating Pearson 
family, for example, the Beta family for shrinkage 
factors. 

The following factors compare the ADM and its fit- 
ting process, perhaps starting with a flat prior on A, 
with that of the MLE. 

1 . Simplicity. An ADM fit is accomplished via a com- 
plexity level comparable to the MLE, that is, 
both require two derivatives. 
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Fig. 1. An equal variances example (Vi — 1) with the MLE on the boundary, S+ = 8,k = 10, r = 0. The left panel plots the 
log posterior density for A, which is the log-likelihood for a flat prior on A. The middle panel plots the log adjusted density 
against A, log(A* Lo(A)) in this case, Lo -likelihood function (see text and Section 2.4)- The right panel shows the log adjusted 
density versus a = log A, which looks more quadratic. 



2. Normality. If a Normal distribution is chosen for 
the matching Pearson family, the ADM approach 
agrees exactly with the MLE, and the variances 
in both cases are estimated by using Fisher's ob- 
served information. 

3. Asymptotics. No matter which Pearson distribu- 
tion is chosen, ADM provides the same asymp- 
totic inferences (for large k) as the MLE. This 
holds because each Pearson family has an asymp- 
totic Normal limit. 

4. Linear expectations. While various transforma- 
tions of a parameter can be considered for the 
MLE, ADM targets the mean. For example, shrin- 
kage factors Bi enter linearly in (4) , so we approx- 
imate their means and variances, not A or some 
other function of B{. 

5. Likelihoods? The ADM procedure could be ter- 
med ALM (Adjustment for Likelihood Maximiza- 
tion), to parallel with MLE language. ALM and 
MLE both work best when a version of the pa- 
rameter is chosen to represent vague prior infor- 
mation, giving a relatively flat prior. We will see 
that ADM-SHP amounts to maximizing not the 
likelihood of A, as the MLE does, but the like- 
lihood after adjustment via multiplication by A. 
Li and Lahiri (2010) proposed using "adjusted 
maximum likelihood estimator" that is identical 
to ADM if r = 0. They showed its advantages in 
small area estimation for estimating shrinkages 
and for constructing parametric bootstrap pre- 
diction intervals. 

6. Multivariate ADM? Adjustments for density max- 
imization agree with the MLE for approxima- 
tions via the Multivariate Normal. The paucity 



of non-Normal multivariate Pearson families re- 
stricts ADM's extensions of the MLE to univari- 
ate parameters. However, hybrid extensions are 
possible, and here we use a multivariate Normal 
to approximate the r-dimensional vector f3 and 
a Beta distribution for a shrinkage factor. 

Given a prior distribution on A > 0, say ir(A) dA 
(proper or not), and still with r = 0, knowledge of 

(9) Bi = E„[Bi\y] and Wi = Var^y) 

enables computation of two moments of 8i, which 
with r = (fii known) are 

(10) E[6 i \y] = (l-B i )y l + B ifM , 

(11) Var(%) = Vi(l - Bi) + Vi ( yi - . 

The second variance component in (11) often is 
not represented in MLE applications, understating 
variances and encouraging over confidence. 

2.2 How Maximum Likelihood Can Distort 
Shrinkage and Random Effects Inferences 

Each of the following issues can cause overassess- 
ment of the information in the data. This perfect 
storm can have serious consequences when k is small 
or moderate. 

1. Nonlinearity. The posterior means and variances 
of the random effects are linear in Bi, not in A. 
Bi{A) = Vi/(Vi + A) is a convex function of A, 
even if A were unbiased for A, one sees, Jensen's 
inequality, which states that Bj(i2L4|y]) > E[Bi\y], 
indicates that the plug-in shrinkage estimate would 
be biased too large. This is why the James-Stein 
estimator that shrinks according to Bjs = ^ k ^ V , 
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uses the k — 2 in its numerator, and not k (as in 
the MLE), and leads to smaller mean squared er- 
rors than when using MLE shrinkages B = ^ . 

2. Boundary limits. Normal approximations to Bi 
put positive probability outside the boundaries 
of the interval [0, 1]. 

3. Boundary pileup. While < Bi < 1 is guaranteed, 
even in the equal variances case V/(V + A un b) = 
jijr^ > 1 is possible. The MLE cannot exceed 1, but 

-Bmle = min(l, kV/S+) = 1 with positive proba- 
bility. This pileup happens despite there being no 
prior distribution on A, other than A = with 
certainty, that can allow £"[i?|y] = 1 for any ob- 
servation y. 

4. Skewness. Lq(A) tends to be right-skewed, sub- 
stantially when the modal value of A is small. 
Alternatively, choose a fixed i and replace A by 



Bi in the likelihood by substituting A 



l-B j 



Vi 



in Lq(A). The resulting likelihood function of Bi 
will be left-skewed. Approximating such a skewed 
likelihood by a symmetric (Normal) distribution 
overstates the magnitude of Bi. A Beta density 
better approximates an asymmetric likelihood. 

5. Zero variances. The MLE approach assesses 
Var(6>;|y,A) as being Vi(l - -E^mle)- When 
^4mle = 0, this approach in effect attributes per- 
fect certainty to A = and that B% = [ii- 

6. Variance components. Estimating the variance 
of 6i by plugging into V^(l — Bi) overlooks the 
variance component Vi = \ax{Bi\y) which would 
account for the uncertainty in A when estimat- 
ing Bi. Ignoring the term Vi(yi — Hi) 2 amounts to 
setting Vi = 0. 

All six of these biases produces over confidence. The 
unknown variance A is underestimated, shrinkage Bi 
is overestimated, and Var(Sj|y) is underestimated. 

2.3 ADM, Adapted to Beta Distributions 

The applications here require approximating the 
means and variances of the shrinkage factors Bi, 
< Bi < 1. Beta distributions are constrained to 
[0, 1] , so are the obvious approximating Pearson dis- 
tribution. Consider an exact Beta distribution for B 
with B ~ Beta(oi,ao) and density 



(12) f(B)dB 



T(ai)T(a ] 



B 



a\ — 1 , 



1 - B 



dB. 



T(ai + a 

Maximizing over B gives B = ^ + "-2 ' tlle m ° de 
(if ai,a,Q > 1), not the mean. The "adjustment" for 



the Beta distribution maximizes the product (B(l — 
B))f(B), giving B = ai + ao , the mean of the Beta(ai, 

ao) distribution. Maximizing a Beta density after 
multiplying by B(l — B) produces the mean, not 
the mode. 
Now let 

(13) £(B)=log{B(l-B)f(B)} 

(14) =a log5 + ailog(l-£). 

This is a concave function, maximized uniquely at 
a point interior to (0,1). We have £'(B) = a__SL = o 

at B = Then 



(15) -?(B)\b=6 = ?Z + 



ao ai + a 



B 2 {l-B) 2 B{l-B) 

Thus, given B and —£"(B) > allows one to re- 
cover a\ and ao via a\ + ao = —£"(B) ■ B(l — B) and 
ai = B(a\ + a ). 
If f(B) is a Beta(ai,ao) density, exactly, then 



E{B) = B, 



(16) 



v = Var(B) 



m-B) 

B(l-B) 



1 + B(l — B)(—£"(B)) 



If a density f(B) is not exactly Beta but it lies near 
to a Beta density, the ADM approach proceeds simi- 
larly, based on two derivatives of log(Z?(l — B)f(B)), 
and approximates E[B] = Bf(B)dB by B, the 
maximizer of this adjusted density. The variance 
Var(i?) is approximated by (16), starting with 



(17) 



£(B) = log{B(l-B)f(B)}. 



That is, ADM for a Beta approximation first finds 
B = argmax(£(i?)). Then it determines —£"(B) and 

uses that to approximate Vai(B) by — ~, B<y )~ B \„, A „ . 

^ v ' J 1+B(1-B){-1" (B)) 

This Beta distribution approximation to a density 
on [0,1] is exact if the original density is a Beta ex- 
actly, and it will be a good approximation if the 
match is close. Its asymptotic accuracy can be eval- 
uated favorably ((Morris, 1988b), with discussion). 

It is useful when fitting shrinkages Bi = Bi(A) to 
re-express the results just outlined in terms of A, or 
equivalently in terms of its logarithm a = log(^4), 
being sure to include the Jacobian in the posterior 
density. Instead of using derivatives of —£(B), the 
"invariant information" will be calculated, defined 
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by 
(18) 



inv.info = 



d 2 l{B) 



d{\og\t{B)} 2 



B=B 



The derivative d\ogit(B)/dB = dlog(j^) /dB 
(B(l — B)), which gives 



V 



d 2 l{B) 



(19) 



d{logit(S)} 2 
= B 2 {l-B) 2 £"{B) 

+ 5(1 -B)(l- 2B)£'(B). 
= 0, we have inv.info = (B(l 



B)f 



As £'(B) 
(-£"(B)). 

Thus, if f{B) is (nearly) a Beta density S~Beta(ai, 
ao), then E[B\ = = B with B = &rgmax(£(B)), 

and the (approximate) variance is 



Var(B) 



(20) 



B(l-B) 
a\ + a + 1 

(5(1 -B)f 



inv.info +5(1 - B) 

Use of this invariant information is especially valu- 
able because of the identity 

d 2 £{B) d 2 £(B(A)) 



(21) 



d{logit(B)} 2 



d{log(A)} 2 

d 2 £(B(A(a))) 
da 2 



This follows from d{logit(B)}=dlog(^) =—da with 
a = log(A). The invariant information is the nega- 
tive second derivative with respect to a of £2(0:), 
being the log density written function of a: 

d 2 £(B) 



inv.info 



(22) 



d{logit(B)} 2 
d 2 £(B(A)) 



B=B 



d(log(A))< 



d 2 £ 2 (a) 



A=A 



da 2 

Thus, inv.info agrees with Fisher's observed infor- 
mation, but only if the parameter is a = log(yl). 

2.4 ADM for Estimating Shrinkage Constants 

Now return to the Normal model with r = and 
likelihood function L${A). Suppose A > has a prior 
density tt(A), not necessarily proper, and consider 



the shrinkage coefficient for component i, l<i<k, 
Bi = v^+A - ^he posterior density for Bi, given y, 
is proportional to Lq(A)tt(A) dA = f{Bi) dBi, where 
A = Vi{\ - Bi) / Bi and dA = -VidBi/B 2 . Then 
f(Bi) = Lo(A)Tr(A)Vi/B 2 is proportional to the den- 
sity of B{. To apply ADM, define 

(23) £ (Bi) =logOB;(l -£;)/(£?;)) 

(24) =\og{A-K{A)L Q (A)) = l(A). 
Still thinking of A function of Bj, 

d£(A) _ dA d£(A) -Vi , 



dBi dBi dA 



The following theorem summarizes what has just 
been demonstrated about the ADM approximation 
by a Beta distribution for Bi = Vi/(Vi + A), starting 
with a posterior density on A that is proportional 
to L (A)ir(A). 

Theorem 1. Given a prior density tt(A) and 
a likelihood function Lq(A), the ADM procedure for 
a Beta distribution approximates the first two pos- 
terior moments of Bi as 



(26) 



E[Bi\y] = Bi 



Vi 



Vi + A' 



where A = argmax(£(^)) ; £(A) = log(Air(A)L Q (A)), 
and 



(27) 



Vi 



Var(fli|y) 



(5(1 -B,)) 2 
inv.info +5(1 - Bi) 



with inv.info = -£" (A) A 2 . 

Neither A nor the invariant information depends 
on i or on Vj. 

2.5 Priors for Good Frequency Performance 

Admissible rules, which are Bayes and extended 
Bayes rules (per the "fundamental theorem of deci- 
sion theory" ), can provide good frequency properties 
if they are based on priors that let the data speak. 
One way to do that restricts to scale invariant im- 
proper priors ft (A) dA = A c_1 dA, < c < 1. As dis- 
cussed earlier, given k, these priors with c > Cfc > 
(cfe < 1/2, but not too small) produce estimators 
of 8i whose posterior means are minimax estimators 
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for squared-error loss in the equal variance setting, 
so that for all vectors 6 (fixed), 

k 

(28) ££{(1 - B(S+)) yi - etf/v < k, 



■i=i 



(29) 



B(S+) = E 



V 



V + A 



S 4 



The choice c = 0, so ir(A)dA = dA/A, puts essen- 
tially all mass at A nearly 0, making B(S+) = 1 with 
certainty, no matter what the data say. This choice 
must be avoided, but sometimes it is not. As c in- 
creases, shrinkages B(S+) decrease. For c = 1 and 
for some smaller values, down to c/., minimax and 
admissible estimators result. 

Our preference A ~ Uniform(0, oo) is equivalent to 
Stein's harmonic prior, that is, for 6 G K fc , k > 3, the 
(improper) measure on 6 is seen to be d6/\\6\\( k ~ 2 \ 
This is the density of 6 if, independently for i = 
l,...,k, 0i\A~ N(0,A) and A ~ Unif [0, oo), as seen 
from 



(30) 



-l/2\\0\\ a /A 



dA 



OC 110 



|2-fc 



This prior with c = 1, that is, A ~ Unif[0,oo), is 
strongly suggested in the equal variance case by the 
fact that the James-Stein shrinkage constant B = 
is precisely the posterior mean E[ V ^_ A \S+] if A ~ 
Unif[— V, oo). Lopping off the impossible part where 
A < leads to 4~ Unif[0,oo) ((Morris, 1983a)). 
That the James-Stein estimator is asymptotically 
optimal for large \\9\\ further suggests its use, that 
is, choosing c= 1. Still in the equal variances case, 
some values of c < 1, for example c = 1/2, shrink 
harder, which lowers the summed mean squared er- 
ror if \\0\\ 2 is suspected not to be large. Experience 
with this flat prior on A has borne out its good fre- 
quency properties in a variety of situations, also in- 
cluding for unequal variances. Supporting evidence 
is given in Sections 3 and 4. 

2.6 Exact Moments for the Uniform Prior in the 
Equal Variances Case 

The exact posterior means and variances of B= V/ 
{V + A) for c = 1, A being uniform ((Morris, 1983a)), 
are as follows. Denote m = (/c — r — 2)/2, so m = (k — 
2)/2 when r = 0. If r > 0, the dimension of /3, then 
the one can shrink toward the r-dimensional fitted 
subspace determined by /3 = (X'X)~ 1 X'y. In the 
(k — r)-dimensional space orthogonal to the range 
of X, shrinkage is toward the 0-vector. We there- 



fore can focus on that k — r subspace with r = 
and k replacing k — r (or think of shrinkage as to- 
ward a known, fixed vector fi as here). Now with j/j ~ 
N(jm,V + A), let S + = £ti(y;-AO\and let T = 
S+/2V. The James-Stein estimate is Bjs = m/T = 
(k — r — 2)V/S+. Let M m (T) be the moment gen- 
erating function of a Beta(l,m) distribution at T, 
a confluent hypergeometric function ((Abramowitz 
and Stegun, 1964)), 



M m (T) 



(31) 



f exp[(l — B)T) dB r 
Jo 



= r(m + l)T- m exp(T)P( X | m < 2T). 
Then ((Morris, 1983a)), 

-Bcxact = E[B\S] 

m 



(32) 



r (l - l/M m (T)) 
-2)V 



S + 
;2 



p(xL + 2<s + /v) 

P( X j m <S + /V) 



(33) 



V exact = Vax(B\S) 
exact 

m 



•IS 



B 



1 



m ■ 



1 



B t 



m 



exact 



(34) 



With r = 0, it follows that 
f,ct,i = E[9i\y] 

— (1 -^exact ) Hi -^exact l^i j 

, Sexact.i = Var(^|y) 
(35) 

= V(l - -Bcxact) + 'Vc x act(yj ~ Mi) • 

The elegance of these formulas for the equal vari- 
ances case is striking. Unfortunately, this disappears 
in the unequal variances case that invariably arises 
in practice, which motivates the search for relatively 
simple alternatives to exact calculations. 

2.7 ADM for Shrinkages, Equal Variances Case 

Maximum likelihood estimates have optimal asym- 
ptotic properties, but the small and moderate sam- 
ple sizes (k) that arise in hierarchical modeling ap- 
plications may be too small for the MLE to perform 
well. The mode of A, or more relevantly of B, may 
be quite inadequate approximations to the posterior 
mean that corresponds to a flat prior that makes 




the likelihood agree with the posterior density. Fig- 
ure 1 provides a simple example for equal variances, 
scaled for a sample size k = 10 with shrinkage to- 
ward zero (r = 0) and a sufficient statistic S+ = 8. 
S + = 8 is the mode of a Xw distribution, and also is 
the largest value of S that makes the James-Stein 
shrinkage estimate Bj$ = 1. Likelihood graphs like 
this are not uncommon in practice, even when un- 
equal variances occur. The right-most panels, which 
have made an adjustment to the likelihood, make 
it possible for two derivatives to capture the distri- 
bution, whereas there is no hope of this with the 
unadjusted left panel. 

Figure 2 plots estimated shrinkages B against T = 
S + /2V , for values of k = 4, 10, 20, each panel show- 
ing three different estimation methods: the exact 
shrinkage estimate for the flat harmonic prior c = 1 , 
SHP (solid curve); the ADM approximation to the 
same prior (dotted); and the MLE =min(l,(m + 
1)/T) = min(l, k/S+). The MLE shrinks much more 
heavily than the other two methods when T (or S+) 
is small. The ADM shrinkage curves are fairly close 
to the exactly computed expected shrinkage in each 
case, but are slightly more conservative. 

When /3 is unknown so that r > 0, the marginal 
distribution of A is gotten by integrating /3 out of 
the joint posterior density of /3 and A (which is done 
in the next section, and extended to unequal vari- 
ances) . The marginal density is neatly written in this 
equal variances case in terms of the sum of squared 
residuals, S + = YliiVi ~ Vi) 2 an d y = Xj3 as 

P {A\y)^(V + A)-^-^ 

\tt(A). 



•exp< - 



2{V + A) 



For it (A) oc A c , the logarithm of the adjusted den- 
sity (multiplying by A) is 



(37) 



>(A\y) = clog A - (m + 1) log(V + A) 
TV 



V + A 1 

T = S + /2V. With no covariates, r = 0, this equation 
continues to hold with m= (k — 2)/2. 
Now, 

de 2 (a) 
da 

_ A dh 
~ A dA 

(38) =-{{m + l-c)A 2 

- (2c + T - m - l)VA - cV 2 ) 
/(V + A) 2 . 

The numerator of (38) is a convex quadratic func- 
tion of A (with m + 1 — c > 0) which is negative at 
A = 0. It therefore has two real roots, one negative 
and unacceptable. The positive root is the ADM es- 
timator A. Then, 

V 



B 



(39) 



V + A 



2(m - c + l) 



T + m + 1 + y/{T -m-l) 2 + 4cT 



Note that B is monotone decreasing in T and that 
B reaches its maximum, 1 — c/ (m + 1) < 1 at T = 0. 
Shrinkage is bounded away from 100% if c > 0, for 
example, if c = 1 and r = the maximum shrinkage 
is (k — 2)/k. These shrinkages B decrease as c in- 
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creases and as c— > in (39), B — > min((m + 1)/T, 1). 
Of course c = is not allowed because then the pos- 
terior guarantees 100% shrinkage, no matter what 
the data say. 

Define a = log( A) and a = log A Then for any c, 
the invariant information = inv.info satisfies 

d 2 l 



inv.info 



(40) 



da 2 

UC Lt 

m(l - Bf + B 2 + (1 



c)(l -2B). 



Matching the first and second derivatives of the two 
densities (i.e., of the adjusted density and of a Beta(ai, 
ao) density) gives 

inv.info inv.info 
a 1 = T -_, <*> = — g-, 

and this Beta distribution has variance 
.. B(l-B) 



(41) 



a + a\ + 1 



£ 2 (l-5) 



m(l - 5) 2 + (1 - c) + (2c - 1)J3 

When c = 1 , the ADM approximations in this equal 
variances case to the posterior moments of B = V/ 
(V + A) are 

2(k-r-2)V 



(42) B 

(43) v 



s + + w+ 7(^ 

B 2 (l-B) 2 
m{l - B) 2 + B' 



kV) 2 + 8S+V 



For the SHP case c = 1 in Figure 2, B is plotted as 
a function of T, showing that the ADM estimate of 
B shrinks slightly less than the exactly computed B, 
while it matches exactly at T = 0, and asymptotes 
to the exact value for large T. The MLE produces 
much larger shrinkages. 

Figure 3, as in Figure 2, also shows graphs for the 
SHP (c = 1) and with curves for m = 1,4,9 (e.g., 
if r = 0, then for k = 4, 10,20). It reveals that the 
ADM approximation to v corresponds well with the 
exact posterior variance of a shrinkage factor, each 
as a function of its own shrinkage B. In both cases 
the shrinkage B decreases monotonically as the suf- 
ficient statistic 5+ rises. Figure 3 shows ADM's ex- 
cellent ADM approximation of the exact variance, 
and that it becomes exact as T nears (where max- 
imal shrinkage in both cases is for B = m/(m + 1) = 
(k — r — 2)/(k — 2)). 
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Fig. 3. Plot of v versus its own B from two different meth- 
ods. The solid line is from the exact method, that is, formu- 
las (32) and (33), and the dotted line is from the approximate 
method, formulas (42) and (43). 

For any r > in this equal variance case, the pre- 
ceding estimates of the shrinkages and of their vari- 
ances provide the following estimates of the means 
and the variances of the random effects 0i in terms of 
the ADM approximations to the posterior moments 
B and [3 = (X'X)" 1 X'y: 



(44) 



(45) 



s 2 



m\y) 

{l-B) yi + Bx'S, 
Var(^|y) 

vii-fy + vteix'xyh 

+ v( yi -x'J) 2 . 



)B 



Note that sf depends on i by increasing proportion- 
ally to the squared residual, as one would expect 
because mis-estimation of B hardly matters when 
(yi — x^P) 2 is small. These results are seen most eas- 
ily by using a least squares regression predictor in 
the r-dimensional range space of X , and shrinking 
to in the (k — r)-dimensional orthogonal subspace. 
The extension to the unequal variance case, which 
is next, is more complicated. 

2.8 The Unequal Variances Case With 
Regression 

An ADM approach to fitting our general model 
starts by integrating out the {8i} to get, in matrix 
notation, 



(46) 



y\P,A~N k (Xp,Dv +A ) 



ESTIMATING RANDOM EFFECTS 
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where Dy+A = diag(l^ + ^4) is a k-hy-k diagonal ma- 
trix. With f3 having a flat prior on R r , standard 
calculations with (46) lead to 

(47) p A = E((3\y, A) = (X' Dy l +A X)~ l X' Dy l +A y. 

With A known, f3 A is at once both the posterior 
mean and the weighted least squares estimate of (3. 
The full distribution, given A, is 

(48) PlAy^NriPA^X'Dy^X)- 1 ). 

The objective is to make inferences about the vec- 
tor = (0i, ... ,9k) with conditional distribution 

0\/3, A,y~ N k ((I - B A )y + B A XP, 

(49) 

(I-B A )V). 

This is (4) in matrix notation, with / the k-hy-k iden- 
tity matrix, V=diag(Vi, ■ ■ • i Vk) and B A = diag(-Bj = 
Vi/(Vi + A)). Integrating out j3, with help from (47), 
it follows that 

9\A, y ~ N k ((I - B A )y + B A Xp A , 

(50) (/ - B A )V 

+ V ^B 1 J 2 P A B 1 J 2 V 1 / 2 ), 

where in (50) P A is a k x k projection matrix of 
rank r, 

(51) P A = D v l f A X(X'Dy 1 +A X)- 1 X'Dy 1 ll. 

When A has prior density element tt(A) dA, the 
posterior density of A, given y, follows: 

p(A\y)^\D v+A \- l / 2 \X'D v \ A X\- l l 2 



(52) 



'V+A' 
•exp(-±(y-X/3 A )' 

■Dy\ A (y-Xp A )). 

The logarithm of this adjusted posterior density, 
with a = log (A), is 

1(a) = \og(Av(A)) 

k 

-^log(V l + A) 

(53) 



■\og\X' Dy\ A X 



1 



-(y-Xp A )'Dy\ A (y-Xp A ). 

Denote a=argmax(/(a)), set A=exp(a), and define 
inv.info = —I" (a). Then the ADM approximation, 
with Bi = Vi/(Vi + A), is Bi = V V ^ A ~ Beta with ap- 
proximate mean E(Bj) = B- t = Vi/(Vi + A) and vari- 



ance «i = Var(Bi) = {Bi(l- Bi)} 2 / {inv.info +Bi(l- 
B{)}, both moments depending on the prior n(A). 
Maximizing £(a) and determining its second deriva- 
tive at a, the negative of the invariant information, 
can be done by numerical methods, by Newton's 
method (which requires matrix derivatives), or by 
other means that include an EM technique available 
in Tang (2002). 

Given A and the values {Bi, Vi},i = 1, . . . , k, one 
could insert A into (50) to estimate both posterior 
moments of the However, that underestimates 
the variance and makes no use of the {v{\, so we 
proceed as follows, leading to a main theorem. 

Define /3 as fi A evaluated at A and y = X(3. Then 
from (50), and approximating (3 A by f3, 



(54) E(9i\A,y) 

(55) Vax(E(9i\A,y)) 



yi-Bi(yi - 
Vi(yi -Vif- 



To minimize complications in making our final ap- 
proximations to E(9i\y) and Var(6>j|y), we neglect 
variations of f3 A in (47) and P A in (51) as A varies 
around A. This is exact in the equal variances case 
because both (3 A and P A do not depend on A, and 
it will be nearly true if the {V^}, z = l,...,k differ 
only slightly. With unequal variances both f5 A and 
(51) involve weights that depend on {Vi +A}. If A 

is near A, as happens when k is large, then y.+ A is 
near 1. With data, one can evaluate 



(56) Var 



Vi + A 
Vi + A 



VarfS 
\Bi 



B 



These variances may be acceptably small, and Vi/B, 
diminishes as 1/k as k — > oo. 



Theorem 2. Assume the model (1), (2), and the 
prior in (5). Write Bi and v- t as the ADM approx- 
imations to E(Bi\y) = E( y V J rA \y) and to \ai(Bi\y). 

Assume E(f3 A \y) = f3 = f3 A and E(P A \y) 
for i = l,...,k 



Pi. Then 



(57) 
(58) 



E(9i\y) 
Vax(9i\y) 



(1 
(1 



Bi)yi + Bix'i/3 i 
(l- Pi ,i)Bi)Vi 



+ Vi(yi-yi) 2 . 



Here pi^ is the ith diagonal term in P A . 

Proof. Equation (57) follows from (50), (54) 
and E(f3 A \y) = (3, since 

E(0i\y) = E{(l-Bi)y i + B l y i }\y. 
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Now use EVE's law (total variation) to get, from (50) 
and (55), 

(59) Vax(p i \y) = EVax(e i \A,y)+Vax(EO i \A,y) 
= E{{l-B i )V i +B iPi , j y i \\y 

(60) 

+ Vi{yi -in) 2 , 

which is (58). 

In our experience, these regression approximations 
when tt(A) = A c_1 , and c = 1 especially, have been 
quite satisfactory. Tang (2002) provides a basis for 
making more precise approximations to E{Pa\v) and 

to E(/3a_\v) based on matrix and determinant deriva- 
tives. In the equal variance case, the theorem's two 
moments are exact provided exact formulas for 
E(Bi\y) and Var(Bi\y) are used. However, Normal- 
ity of 9i\A,y does not hold exactly for 9{ after av- 
eraging over A\y, although that Normal approxima- 
tion is commonly made. □ 

3. APPROXIMATION ACCURACY 

3.1 Approximation Accuracy of Shrinkages and 
the Random Effects 

Figures 2 and 3 show in the equal variance setting 
that even for small samples like k = 4, 10, 20, the 
ADM approximation of the first two exactly com- 
puted posterior moments of B is quite good. Our 
end goal, however, is verifying this leads to good ap- 
proximations of the posterior means and variances 
of each random effect (9i, i = 1, . . . , k). 

First, in the equal variance situation with r = 0, 
we compare the weighted average of posterior mean 
squared error of the 9i values via the ADM approxi- 
mation with this measure with the "exact" posterior 
mean. Let us measure the difference of their mean 
squared errors, given the data y, by computing 



(61) 




\y 



for the ADM approximation, with the expectation 
calculated exactly, when ir(A) = 1. Now 
( k \ 



\y 



i=l 



.2 

3 e,i) 



where the subscript e denotes estimates done exactly 
(see Section 2.6), with s 2 ei is given in (35). Therefore 



(62) 



s 2 

=1 °e,i 



i=l 



i=l 



measures how well the ADM approximation works 
for random effects estimates, smaller values indicat- 
ing better approximations. The highest (worst) ra- 
tio is 1.1% which occurs for k near 20, and for 60% 
shrinkage. Greater accuracy holds for k < 20 and for 
k > 20. Thus, in the equal variances setting, the con- 
ditional mean squared errors of the ADM approxi- 
mation and the exact estimator of 9 never differ by 
more than 1.1%. 

Now, still with tt(A) = 1, consider the unequal 
variance case and ADM's accuracy for approximat- 
ing the exact Bayes estimator of 9. The following ex- 
ample involves two groups of variances for the yi val- 
ues, and estimates the unknown mean vector /ii = 
• ■ • = Hio in the second level (so k = 10, r = 1). Five 
"small" variances are set at V\ = V2 = - ■ - = ^=0.55, 
and five "large" ones at Vq = - ■ - = Vio = 5.5. Their ma- 
ximum-to-minimum variance ratio is a factor of 10, 
and their harmonic mean is 1.0 (for convenience 
only) . Shrinkages B\ = ■ • ■ = Bq < Bq = ■ • ■ = Biq 
are toward the nine-dimensional subspace orthogo- 
nal to the unit vector. We calculated exact and ADM 
means and variances of these shrinkages, which de- 
pend on the separate values of the two-dimensional 
statistic T\,T2 (these two sums of squares are stan- 
dardized by their respective 2Vi, each summed over 
its respective subgroup of size 5, both centered on 
their common fitted grand mean). 

Figure 4 concerns shrinkages for the first five com- 
ponents with small variances, Vj = 0.55, and Fig- 
ure 5 shows shrinkages for the five components with 
large variances, Vj = 5.5. The left panels of each 
figure show shrinkage factor patterns for three dif- 
ferent rules: the MLE (dashed curve), the exactly 
computed shrinkage using the harmonic prior for 
which A has a flat density (solid curve), and the 
ADM approximations to that shrinkage factor (dot- 
ted curve). These are graphed as a function of T\ 
(Figure 4) and T2 (Figure 5) with separate displays, 
each conditional on one of four different values of 
the opposite Tj. 

Both figures show that the MLE has quite large 
shrinkages, just as for equal variances. The relation- 
ship between the ADM approximation and the ex- 
actly computed expected shrinkage that the ADM 




T2 = 5.5 
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Fig. 4. Approximation accuracy for two groups of variances, here for the small variance group (k = 10, r — 1, 
Vl = • ■ ■ = Vb = 0.55, Vfj = ■ ■ ■ = Vio = 5.5). The left-hand side plots Bi against T±, with T2 fixed at various values (which 
correspond to A = 0, 0.55, 1, 5.5). The right-hand side plots Var(Bi |data) against B\. Solid line is from the exact method, 
dotted line from ADM approximation, long dashed line is MLE. 



approximates is similar to what was seen in the 
equal variance case. The right-hand panels of each 
figure show good agreement between the ADM vari- 
ance approximation and the exactly computed vari- 
ances Vi when each is plotted against its own shrink- 
age. The maximum shrinkages for ADM and the ex- 
act rule are limited to values < 1, curtailing the hor- 
izontal axes for plots of Var(i?j). 

To summarize for the prior it {A) = 1, the ADM 
approximations of exact shrinkage factors for pos- 
terior means and variances of shrinkage factors are 



slightly conservative, but generally are in good agree- 
ment with the exact values obtained in the equal 
variance case. Similar results hold for the unequal 
variance case when variances Vi differ by a factor of 
10 and when r = 1. 

4. COVERAGE PROBABILITIES AND RISK 
FUNCTIONS 

Confidence interval coverage rates for 0i are eval- 
uated next for the two main procedures of Section 2, 
both based on assuming A > has a flat prior n(A) = 
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Ti - 1 
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Fig. 5. Approximation accuracy for two groups of variances, here for the large variance group (k = 10, r = 1, 
Vi = •■■ = 14 = 0.55, Vq = ■•• = Vio = 5.5). The left-hand side plots B2 against T2, with Ti fixed at various values (which 
correspond to A — 0, 0.55, 1, 5.5). The right-hand side plots Var(B2|data) against Ti. Solid line is from the exact method, 
dotted line from ADM approximation, long dashed line is MLE. 



1 so that the posterior density is the likelihood func- 
tion. One procedure, labeled "exact" here, evaluates 
the exactly computed posterior means and variances 
of 9i, given y, as in (34) and (35) for the equal vari- 
ances case, and otherwise by numerical integration. 
It then assigns a Normal distribution with these 
two moments to determine a posterior interval. The 
second approach uses Normal distributions in the 
same way, but centered and scaled via the ADM 
approximations of these two moments in (44) and 
(45), or when r > and with unequal variances, as 



in (57) and (58). Normal distributions are not ex- 
act for 8i, since the actual distributions are skewed 
(right-skewed for relatively large yi, and left-skewed 
for small yi). This matters less in repeated sampling 
evaluations that randomize over y, making skew- 
nesses average to zero for each i. 

For all i = 1, . . . , k, we seek two-tailed frequency 
coverage probabilities as a function of A: 



(63) 



Pr 



< (z*) 2 \A 



ESTIMATING RANDOM EFFECTS 
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Fig. 6. P/ot o/ coverage probabilities of Oi (random effects) for equal variances without regression, each against the true 
shrinkage factor. In the three graphs (k = 4, 10, 20), both the "exact" Bayes (solid curve) and its ADM-SHP approximation 
(dotted curve) achieve approximately the nominal 0.95 coverage rates (as indicated by the bold dashed horizontal line), or 
higher. The MLE (long dashes) can be markedly nonconservative, especially with large true shrinkages B (A near 0). As A 
approaches 0, MLE coverages fall below 50%, however large k might be. 



when the nominal coverage is 95%, so z* = 1.96. 
Each procedure studied uses its own estimate s? of 
the conditional variance of 0{. A related measure 
directly assesses how well each sf envelops the ex- 
pected squared error, given A, with values < 1 indi- 
cating that Si assigns sufficiently large intervals: 



(64) 



E^-kf/s^A}. 



Details of the simulation are in Tang (2002), where 
Rao-Blackwellization increased the accuracy by eval- 
uating some conditional Normal distributions ex- 
actly, given A and y. That is, for (64), 



Pr 



(0i - Bjf 

8? 



<{z*Y\A 



E 



E 



Pr 



<(z*) 2 \y,P,A 



A 



(1 - Bj) yi - Bjx'jP + z* Sl 
y/Vi(l -Bi) 
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4.1 Equal Variances Example 
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Figure 6 plots the actual coverage probabilities 
for the three confidence interval procedures, each 
against the possible "true" B values, for three equal 



variance procedures always with r = 0, and for k = 4 
( m = 1), k= 10 (m = 4) and k = 20 (m = 9). For 
each B = 0.005, 0.015, . . . , 0.995, 1000 data sets were 
generated and the interval procedures for "exact," 
its ADM approximation, and the MLE were evalu- 
ated and averaged to estimate the coverage proba- 
bilities. Confidence intervals for the MLE were de- 
termined simply by taking each variance to be the 
MLE V(l - B). These MLE coverages are plotted 
with long dashes in Figure 6. When shrinkage B is 
large, these MLE intervals give poor coverages, ul- 
timately dropping to just under 50%, as shown in 
Section 2. 

The graph of Figure 6 is redone in the first row of 
Figure 7, but without the MLE. That allows an am- 
plified scale that shows the slight differences in cov- 
erage rates between the "exact" rule and its ADM- 
SHP approximation. The ADM-SHP coverages meet 
or exceed > 0.95 for all A (within simulation error). 
The "exact" procedure's coverages can be slightly 
nonconservative, but its lowest coverage is at least 
0.945 (when k = 20 and B = 0.4) for all k shown. 
The ADM-SHP intervals achieve (or exceed) their 
nominal 0.95 coverage rates by having slightly wider 
intervals than "exact," due to ADM's reduced shrink- 
age estimates and its larger variance estimates v, 
as studied in Section 3. As B increases both meth- 
ods become quite conservative, with coverages well 
above 0.95. 
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k=4, r=0 k=10, r=0 k=20, r=0 
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Fig. 7. Plot of coverage probabilities and standardized risk functions for equal variances without regression. The first row 
plots coverage probabilities against true values of B = V/(V + A) on a larger scale than in Figure 6, with ADM-SHP coverages 
being the dotted curves. The second row plots the expected value of the loss function calibrated by s\ (64). 



The bottom row of Figure 7 plots the function (64) 
against B to compare the two different methods. 
Values less than 1.0 indicate that the estimated vari- 
ances sf average to as much as or more than the 
average mean square. This further suggests that the 
interval coverages will (nearly) provide the nominal 
coverage (95%) for all values of A > 0. 

4.2 An Unequal Variances Example: Two Groups 
of Variances 

We return to the unequal variances example of 
Section 3 with k = 10, r = 1, V\ = • ■ ■ = V 5 = 0.55, 
and Vq = ■ ■ ■ = Vio = 5.5. For this simulation, 100 
data sets were generated for each of 50 values -Bo = 
0.01, 0.03, . . . ,0.99, where B = Vq/{Vq + A) and V = 
1 is the harmonic mean of the V{. Nominal 95% 
confidence intervals for each 9i were evaluated for 
each data set. The confidence rates and average cal- 
ibrated losses (64) then were averaged over the sim- 
ulated values. 



Figure 8 plots coverages of the ADM-SHP inter- 
vals and calibrated risk functions (64) for 6\ and 
for #io as Bq = 1/(1 + A) varies. The upper left 
panel of Figure 8 plots the coverage probabilities 
against Bq for the group of five with small variances 
Vi = 0.55, and the upper right for the remaining 
group of five with large variances Vi = 5.50. As Bq 
increases and A decreases, coverage rates generally 
increase. Coverages achieve or exceed their nomi- 
nal 0.95 levels (within simulation error), while for 
small A and big Bq, coverages for the large vari- 
ance group substantially exceed both their nominal 
rate and the coverages for the small variance group. 
The calibrated risks are less than 1.0 in Figure 8 
which show that the intervals are wide enough to 
be conservative, although they may be excessively 
conservative for the large variance group. One rem- 
edy could be using the scale-invariant prior c = 0.5, 
which makes y/A flat. Coverages rates for the exact 
version of SHP were not evaluated for this unequal 
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Fig. 8. Plot of ADM-SHP coverages and expected value 
of average calibrated losses against Bo = 1/(1 + A). Here 
fc=10,r = l ) ft = • • • = Vi = 0.55, V 6 = --- = V W = 5.5. 



variance case, and that can be time-consuming for 
repeated sampling. Simple and fast computing, plus 
a procedure's transparency, are reasons for finding 
simple and accurate approximations. 

5. CONCLUSIONS 

Why might a Bayesian or objective Bayesian sta- 
tistician who has settled on prior distribution tt(A) 
on A consider approximating with ADM? There are 
several reasons, beyond the general observation that 
any procedure used in an application is an approxi- 
mation. 

1. Speed of convergence is valuable with big data 
sets, especially if a procedure is to be used re- 
peatedly for model selection and model checking. 
The approximations here avoid MCMC burn-ins. 
Speed also makes it feasible to simulate many 
times, for example, for bootstrapping, or to check 
a procedure's operating characteristics. 

2. Data analysts may need to obtain the same re- 
sults each time a particular model is re-fit to the 
same data, which stochastic approximations do 
not do. 

3. MLE methods always will play a central role in 
statistics. For the model of this paper, ADM main- 
tains the spirit of MLE while making small sam- 
ple improvements. 

4. Using ADM to help fit shrinkage factors extends 
to multilevel generalized linear models, for ex- 
ample, to fit a Poisson model ((Christiansen and 



Morris, 1997)). In such more complicated non- 
Normal models, MCMC and exact numerical in- 
tegration may be more difficult or impossible, 
giving MLE and ADM a greater advantage of 
ease. Then the frequency properties of ADM can 
be checked with each data application by simu- 
lating or bootstrapping from the fitted multilevel 
model. However, that will not reveal how well 
ADM approximates the exact Bayes procedure. 

5. Multiplying a likelihood by A before maximiz- 
ing combines neatly with EM methods as used 
to find the MLE of A ((Dempster, Laird and Ru- 
bin, 1977)). With ADM, EM would avoid infinite 
loops that occur when the MLE A = 0. 

6. Data analysts always will need well-checked, pre- 
packaged, documented, widely known and avail- 
able procedures for fitting models. 

7. Statistical software programmers should find it 
easy to program and adopt the ADM-SHP for- 
mulas, for example, the formulas of Section 2.8, 
in standard software. For example, ADM could 
be an option in SAS PROC MIXED along with 
MLE and REML. 

Barring prior information that A is likely to be 
small, the ADM-SHP methods developed here for ma- 
king inferences, especially interval estimates, about 
the random effects in a two-level Normal regression 
model will have better frequency performance over 
the entire range of A > than MLE and REML 
methods. Our derivation has benefited from view- 
ing Stein's harmonic prior SHP on the random ef- 
fects 9i as arising from a uniform mixture over A of 
the Level-2 Normal distribution (2), that is, accord- 
ing to tt(A) = 1. 

With this formal (improper) prior, the posterior 
density on A agrees with the marginalized likeli- 
hood function L(A). That justifies the term "adjust- 
ment for likelihood maximization" when "ALM" is 
restricted to point estimation of a shrinkage factor. 
The results here go on to use the flat n(A) = 1 prior 
and conditional (Bayesian) reasoning as a guide to 
accounting for variability of the shrinkage factors Bi 
and ultimately, of the random effects f9j. ADM ap- 
proximates the exact Bayes procedures with con- 
siderable accuracy, given that it retains the (rela- 
tive) ease of MLE/REML calculations, that is, by 
using two derivatives of the adjusted log-likelihood 
\og{A L(A)). Of course the adjustment here more 
generally would adjust by using the multiplier n(A) 
if tt(A) 7^ 1. While more testing is needed for un- 
equal variances cases, the confidence intervals for 
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random effects arising from the ADM-SHP combi- 
nation here thus far have met or exceeded their nom- 
inal coverages if k — r > 3. Still, the search should 
continue for priors on A that will provide even bet- 
ter frequency interval coverages. 
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